A common strategy to deal with these issues is to build repeated (weak learners) models on the same data and combine them to form a single result.
These are called ensemble or consensus estimators/predictors.
As a general rule, ensemble learners tend to improve the results obtained with the weak learners they are made of.
Ensemble can be built on different learners but we will focus on those built on trees:
Two questions arise here:
The bootstrap has been applied to almost any problem in Statistics.
We begin with the easiest and best known case: estimating the standard error (that is the square root of the variance) of an estimator.
\[\begin{eqnarray*} \theta &=& E_F(X)=\theta (F) \\ \theta &=& Med(X)=\{m:P_F(X\leq m)=1/2\}= \theta (F). \end{eqnarray*}\]
\[\begin{eqnarray*} \hat{\theta}&=&\overline{X}=\int XdF_n(x)=\frac 1n\sum_{i=1}^nx_i=\theta (F_n) \\ \hat{\theta}&=&\widehat{Med}(X)=\{m:\frac{\#x_i\leq m}n=1/2\}=\theta (F_n) \end{eqnarray*}\]
An important when computing an estimator \(\hat \theta\) of a parameter \(\theta\) is how precise is \(\hat \theta\) as an estimator of \(\theta\)?
\[ \sigma _{\overline{X}}=\frac{\sigma (X)}{\sqrt{n}}=\frac{\sqrt{\int [x-\int x\,dF(x)]\sp 2dF(x)}}{\sqrt{n}}=\sigma _{\overline{X}}(F) \]
then, the standard error estimator is the same functional applied on \(F_n\), that is:
\[ \hat{\sigma}_{\overline{X}}=\frac{\hat{\sigma}(X)}{\sqrt{n}}=\frac{\sqrt{1/n\sum_{i=1}^n(x_i-\overline{x})^2}}{\sqrt{n}}=\sigma _{\overline{X}}(F_n). \]
The bootstrap method makes it possible to do the desired approximation: \[\hat{\sigma}_{\hat\theta} \simeq \sigma _{\hat\theta}(F_n)\] without having to to know the form of \(\sigma_{\hat\theta}(F)\).
To do this,the bootstrap estimates, or directly approaches \(\sigma_{\hat{\theta}}(F_n)\) over the sample.
The bootstrap allows to estimate the standard error from samples of \(F_n\), that is,
Substituting \(F_n\) by \(F\) carried out in the sampling step.
\[\begin{eqnarray*} &&\mbox{Instead of: } \\ && \quad F\stackrel{s.r.s}{\longrightarrow }{\bf X} = (X_1,X_2,\dots, X_n) \, \quad (\hat \sigma_{\hat\theta} =\underbrace{\sigma_\theta(F_n)}_{unknown}) \\ && \mbox{It is done: } \\ && \quad F_n\stackrel{s.r.s}{\longrightarrow }\quad {\bf X^{*}}=(X_1^{*},X_2^{*}, \dots ,X_n^{*}) \quad (\hat \sigma_{\hat\theta}= \hat \sigma_{\hat \theta}^* \simeq \sigma_{\hat \theta}^*). \end{eqnarray*}\]
Here, \(\sigma_{\hat \theta}^*\) is the bootstrap standard error of \(\hat \theta\) and
\(\hat \sigma_{\hat \theta}^*\) the bootstrap estimate of the standard error of \(\hat \theta\).
That is, the new (re-)sampling process consists of extracting samples of size \(n\) of \(F_n\):
\({\bf X^{*}}=(X_1^{*},X_2^{*},\dots ,X_n^{*})\) is a random sample of size \(n\) obtained with replacement from the original sample \((X_1,X_2,\dots ,X_n)\).
Samples \({\bf X^*}\), obtained through this procedure are called bootstrap samples or re-samples.
\[\begin{eqnarray*} \mathcal {L}(\hat \theta)&\simeq& P_F(\hat\theta \leq t): \mbox{Sampling distribution of } \hat \theta,\\ \mathcal {L}(\hat \theta^*)&\simeq& P_{F_n}(\hat\theta^* \leq t): \mbox{Bootstrap distribution of } \hat \theta, \end{eqnarray*}\]
This distribution is usually not known.
However the sampling process and the calculation of the statistics can be approximated using a Monte Carlo Algorithm.
\[ \mbox{if }B\rightarrow\infty \mbox{ then } \hat{\sigma}_B (\hat\theta) \rightarrow \hat\sigma_{\infty} (\hat\theta) =\sigma_B(\hat\theta)=\sigma_{\hat\theta}(F_n). \]
The bootstrap approximation, \(\hat{\sigma}_B(\hat\theta)\), to the bootstrap SE, \(\sigma_B(\hat\theta)\), provides an estimate of \(\sigma_{\hat\theta}(F_n)\):
\[ \hat{\sigma}_B(\hat\theta)(\simeq \sigma_B(\hat\theta)=\sigma_{\hat\theta}(F_n))\simeq\hat \sigma_{\hat\theta}(F_n). \]
From real world to bootstrap world:
\[\hat f_{bag}(x)=\frac 1B \sum_{b=1}^B \hat f^{*b}(x) \]
\[ \hat G_{bag}(x) = \arg \max_k \hat f_{bag}(x). \]
Since each out-of-bag set is not used to train the model, it can be used to evaluate performance.
Source: https://www.baeldung.com/cs/random-forests-out-of-bag-error
This exampe relies on the well-known AmesHousing dataset on house prices in Ames, IA.
We use libraries:
rpart for stratified resamplingipred for bagging.A complementary way to interpret a tree is by quantifying how important is each feature.
Done measuring the total reduction in loss function associated with each variable across all splits.
This measure can be extended to an ensemble simply by adding up variable importance over all trees built.
caretvip function from the vip package can be used (see lab examples).Random Forests Algorithm, from chapter 17 in (Hastie and Efron 2016)
# number of features
n_features <- length(setdiff(names(ames_train), "Sale_Price"))
# train a default random forest model
ames_rf1 <- ranger(
Sale_Price ~ .,
data = ames_train,
mtry = floor(n_features / 3),
respect.unordered.factors = "order",
seed = 123
)
# get OOB RMSE
(default_rmse <- sqrt(ames_rf1$prediction.error))
## [1] 24859.27There are several parametres that, appropriately tuned, can improve RF performance.
1 and 2 tend to have the largest impact on predictive accuracy.